Goto

Collaborating Authors

 Measuring Intelligence


Generating Correct Answers for Progressive Matrices Intelligence Tests

Neural Information Processing Systems

Raven's Progressive Matrices are multiple-choice intelligence tests, where one tries to complete the missing location in a 3 3 grid of abstract images. Previous attempts to address this test have focused solely on selecting the right answer out of the multiple choices. In this work, we focus, instead, on generating a correct answer given the grid, without seeing the choices, which is a harder task, by definition. The proposed neural model combines multiple advances in generative models, including employing multiple pathways through the same network, using the reparameterization trick along two pathways to make their encoding compatible, a dynamic application of variational losses, and a complex perceptual loss that is coupled with a selective backpropagation procedure. Our algorithm is able not only to generate a set of plausible answers, but also to be competitive to the state of the art methods in multiple-choice tests.


Some things to know about achieving artificial general intelligence

arXiv.org Artificial Intelligence

Current and foreseeable GenAI models are not capable of achieving artificial general intelligence because they are burdened with anthropogenic debt. They depend heavily on human input to provide well-structured problems, architecture, and training data. They cast every problem as a language pattern learning problem and are thus not capable of the kind of autonomy needed to achieve artificial general intelligence. Current models succeed at their tasks because people solve most of the problems to which these models are directed, leaving only simple computations for the model to perform, such as gradient descent. Another barrier is the need to recognize that there are multiple kinds of problems, some of which cannot be solved by available computational methods (for example, "insight problems"). Current methods for evaluating models (benchmarks and tests) are not adequate to identify the generality of the solutions, because it is impossible to infer the means by which a problem was solved from the fact of its solution. A test could be passed, for example, by a test-specific or a test-general method. It is a logical fallacy (affirming the consequent) to infer a method of solution from the observation of success.


MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

arXiv.org Artificial Intelligence

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.


Generating Correct Answers for Progressive Matrices Intelligence Tests

Neural Information Processing Systems

Raven's Progressive Matrices are multiple-choice intelligence tests, where one tries to complete the missing location in a 3 3 grid of abstract images. Previous attempts to address this test have focused solely on selecting the right answer out of the multiple choices. In this work, we focus, instead, on generating a correct answer given the grid, without seeing the choices, which is a harder task, by definition. The proposed neural model combines multiple advances in generative models, including employing multiple pathways through the same network, using the reparameterization trick along two pathways to make their encoding compatible, a dynamic application of variational losses, and a complex perceptual loss that is coupled with a selective backpropagation procedure. Our algorithm is able not only to generate a set of plausible answers, but also to be competitive to the state of the art methods in multiple-choice tests.


Review for NeurIPS paper: Generating Correct Answers for Progressive Matrices Intelligence Tests

Neural Information Processing Systems

I have read the reviews and the author response and I have also asked an expert AC to also provide a comment in lieu of a 4th reviewer (pasted below for reference). Taken all these together I will recommend acceptance, with a note. NOTE TO AUTHORS: This work is going to be the reference paper for using generation as opposed to discrimination. As such, it is really crucial to set the right path for evaluating model in a fair and rigorous way, so that research that follows on builds on a solid base. The presented evaluation has some issues (see points bellow).


Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI

arXiv.org Artificial Intelligence

OpenAI's o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence. This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI). Building on the distinction between skills and intelligence made by Fran\c{c}ois Chollet, the creator of ARC-AGI, a new understanding of intelligence is introduced: an agent is the more intelligent, the more efficiently it can achieve the more diverse goals in the more diverse worlds with the less knowledge. An analysis of the ARC-AGI benchmark shows that its tasks represent a very specific type of problem that can be solved by massive trialling of combinations of predefined operations. This method is also applied by o3, achieving its high score through the extensive use of computing power. However, for most problems in the physical world and in the human domain, solutions cannot be tested in advance and predefined operations are not available. Consequently, massive trialling of predefined operations, as o3 does, cannot be a basis for AGI - instead, new approaches are required that can reliably solve a wide variety of problems without existing skills. To support this development, a new benchmark for intelligence is outlined that covers a much higher diversity of unknown tasks to be solved, thus enabling a comprehensive assessment of intelligence and of progress towards AGI.


Using Fitts' Law to Benchmark Assisted Human-Robot Performance

arXiv.org Artificial Intelligence

Shared control systems aim to combine human and robot abilities to improve task performance. However, achieving optimal performance requires that the robot's level of assistance adjusts the operator's cognitive workload in response to the task difficulty. Understanding and dynamically adjusting this balance is crucial to maximizing efficiency and user satisfaction. In this paper, we propose a novel benchmarking method for shared control systems based on Fitts' Law to formally parameterize the difficulty level of a target-reaching task. With this we systematically quantify and model the effect of task difficulty (i.e. size and distance of target) and robot autonomy on task performance and operators' cognitive load and trust levels. Our empirical results (N=24) not only show that both task difficulty and robot autonomy influence task performance, but also that the performance can be modelled using these parameters, which may allow for the generalization of this relationship across more diverse setups. We also found that the users' perceived cognitive load and trust were influenced by these factors. Given the challenges in directly measuring cognitive load in real-time, our adapted Fitts' model presents a potential alternative approach to estimate cognitive load through determining the difficulty level of the task, with the assumption that greater task difficulty results in higher cognitive load levels. We hope that these insights and our proposed framework inspire future works to further investigate the generalizability of the method, ultimately enabling the benchmarking and systematic assessment of shared control quality and user impact, which will aid in the development of more effective and adaptable systems.


The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz

arXiv.org Artificial Intelligence

This research introduces a novel evaluation framework designed to assess large language models' (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. Using a curated dataset of graduate-level grand challenge questions with intentionally unknowable answers, we evaluated twelve state-of-the-art LLMs, including both open and closed-source models, on their propensity to admit ignorance rather than generate plausible but incorrect responses. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics. We observed an inverse relationship between problem difficulty and model accuracy, with GPT-4 demonstrating higher rates of uncertainty acknowledgment on more challenging problems (35.8%) compared to simpler ones (20.0%). This pattern indicates that models may be more prone to generate speculative answers when problems appear more tractable. The study also revealed significant variations across problem categories, with models showing difficulty in acknowledging uncertainty in invention and NP-hard problems while performing relatively better on philosophical and psychological challenges. These results contribute to the growing body of research on artificial general intelligence (AGI) assessment by highlighting the importance of uncertainty recognition as a critical component of future machine intelligence evaluation. This impossibility test thus extends previous theoretical frameworks for universal intelligence testing by providing empirical evidence of current limitations in LLMs' ability to recognize their own knowledge boundaries, suggesting new directions for improving model training architectures and evaluation approaches.


Generating Correct Answers for Progressive Matrices Intelligence Tests

Neural Information Processing Systems

Raven's Progressive Matrices are multiple-choice intelligence tests, where one tries to complete the missing location in a 3x3 grid of abstract images. Previous attempts to address this test have focused solely on selecting the right answer out of the multiple choices. In this work, we focus, instead, on generating a correct answer given the grid, which is a harder task, by definition. The proposed neural model combines multiple advances in generative models, including employing multiple pathways through the same network, using the reparameterization trick along two pathways to make their encoding compatible, a selective application of variational losses, and a complex perceptual loss that is coupled with a selective backpropagation procedure. Our algorithm is able not only to generate a set of plausible answers but also to be competitive to the state of the art methods in multiple-choice tests.


On a measure of intelligence

arXiv.org Artificial Intelligence

The measure of intelligence is the ability to change. Abstract The Fall 2024 Logic in Computer Science column of the Bulletin of EATCS is a little discussion on intelligence, measuring intelligence, and related issues, provoked by a fascinating must-read article "On the measure of intelligence" by François Chollet. The discussion includes a modicum of critique of the article. Q: Is it about psychology? Chollet is a prominent figure in AI. Q: We spoke about AI last spring. But you didn't seem to be interested in AI before that. A: This is largely correct, though I read Norbert Wiener's "Cybernetics" [18], when it was translated to Russian in 1968, and was taken with it. For a while I tried to follow cybernetics developments, at least in the USSR.